LinkedIn Post Analysis

A Year in Data

NLP + ggplot study of my LinkedIn posts and engagement patterns.
Author

Satkar Karki

Published

August 15, 2025

Keywords

linkedin, data engineering, nlp, ggplot, carousel

1 Overview: LinkedIn Post Analysis

This project applies Natural Language Processing (NLP) techniques to analyze my LinkedIn posts from the past year. The primary goal is to combine data collection, text pre-processing, and statistical analysis to uncover how my word usage and content patterns have evolved.

The workflow includes:

  • Data acquisition – retrieving my LinkedIn posts and associated metadata.

  • Post type distribution – plotting the frequency of different post categories over time.

  • Text pre-processing – tokenizing text, removing stop words, filtering out INSTANT_SHARE reposts, stripping links, and producing a tidy dataset with one token per row.

  • Word frequency analysis – calculating unigram and bigram counts.

  • Change-over-time analysis – measuring shifts in word usage using slope calculations for both unigrams and bigrams.

This end-to-end process demonstrates how NLP can be applied to personal social media data for content strategy insights and longitudinal language trends.

2 Loading the Data

# A tibble: 6 × 4
  urn                                 text             created_at          type 
  <chr>                               <chr>            <dttm>              <chr>
1 urn:li:activity:7361477882038099968 "3 𝐰𝐚𝐲𝐬 𝐈 𝐢𝐧𝐯𝐞𝐬… 2025-08-13 14:24:56 TEXT 
2 urn:li:activity:7360783785690374145 "Last semester,… 2025-08-11 16:26:51 CELE…
3 urn:li:activity:7360163662294048769  <NA>            2025-08-09 23:22:42 INST…
4 urn:li:activity:7359412088483504130 "🎯 Graduating S… 2025-08-07 21:36:13 UNKN…
5 urn:li:activity:7359252038888669188 "One habit that… 2025-08-07 11:00:14 TEXT 
6 urn:li:activity:7358345640961024000 "I just pulled … 2025-08-04 22:58:32 IMAGE
 POSIXct[1:72], format: "2025-08-13 14:24:56" "2025-08-11 16:26:51" "2025-08-09 23:22:42" ...

2.1 Post Type Distribution

Let’s look at the distribution of the posts made over time.

3 Post Types Over Time

Let’s track how different post types evolved over time.

I kicked off my LinkedIn journey by sharing the certificates I was earning from Coursera, edX, LinkedIn Learning, and Udemy. In January 2025 I made a simple promise to post consistently. After finishing a Personal Selling course I stopped leaning on quick shares and started writing my own updates. By March the shift was obvious. Images, text posts, and document carousels became the core of my feed, while instant reposts faded into the background. Over the summer I added video to widen the story I can tell. The result is a steady move from announcements to original content that teaches, documents, and shows my work. Next, I will keep the cadence, scale the formats that land, and use video more often to bring projects to life.

4 Text Preprocessing Pipeline

This section implements a comprehensive text preprocessing pipeline designed specifically for LinkedIn post analysis. The preprocessing steps are carefully sequenced to transform raw social media text into clean, analyzable tokens while preserving meaningful content and removing noise.

4.1 Preprocessing Strategy Overview

1. Data Filtering: We begin by filtering out INSTANT_SHARE posts, which are typically reposts or shares that don’t contain original content. This ensures our analysis focuses on authentic user-generated content.

2. Link Removal: External links are identified and removed using regex patterns that match common URL formats (http/https, www, linkedin.com, etc.). This prevents links from being treated as meaningful text tokens while preserving the semantic content of the posts.

3. Text Normalization: The text undergoes several normalization steps including conversion to lowercase, removal of extra whitespace, and handling of special characters to ensure consistent tokenization.

4. Tokenization: Text is broken down into individual words using the unnest_tokens() function, which handles punctuation, contractions, and word boundaries appropriately for social media text.

5. Stopword Removal: Common English stopwords are removed to focus analysis on content-bearing words. This includes articles, prepositions, common verbs, and other function words that don’t contribute to topic analysis.

6. Hashtag Preservation: Hashtags are intentionally preserved as they often contain valuable topic and sentiment information specific to social media content.

7. Quality Filtering: Final filtering removes very short tokens (less than 3 characters) and ensures we have meaningful content for analysis.

4.2 Implementation Details

The preprocessing pipeline is implemented using tidytext principles, ensuring that each step produces a clean, structured dataset ready for downstream analysis. The pipeline maintains data integrity by preserving post metadata (timestamps, engagement metrics) while transforming only the text content.

Original posts: 72 
Posts after filtering INSTANT_SHARE: 65 
Posts removed: 7 

Total tokens after preprocessing: 3460 
Unique words: 1699 
Average words per post: 53.2 
# A tibble: 20 × 3
   created_at          word      type 
   <dttm>              <chr>     <chr>
 1 2025-08-13 14:24:56 𝐰𝐚𝐲𝐬      TEXT 
 2 2025-08-13 14:24:56 𝐢𝐧𝐯𝐞𝐬𝐭𝐞𝐝  TEXT 
 3 2025-08-13 14:24:56 𝐠𝐫𝐨𝐰𝐭𝐡    TEXT 
 4 2025-08-13 14:24:56 𝐝𝐚𝐭𝐚      TEXT 
 5 2025-08-13 14:24:56 𝐚𝐧𝐚𝐥𝐲𝐬𝐭   TEXT 
 6 2025-08-13 14:24:56 𝐈𝐧𝐯𝐞𝐬𝐭𝐢𝐧𝐠 TEXT 
 7 2025-08-13 14:24:56 𝐬𝐞𝐜𝐨𝐧𝐝    TEXT 
 8 2025-08-13 14:24:56 𝐬𝐜𝐫𝐞𝐞𝐧    TEXT 
 9 2025-08-13 14:24:56 semester  TEXT 
10 2025-08-13 14:24:56 msba      TEXT 
11 2025-08-13 14:24:56 program   TEXT 
12 2025-08-13 14:24:56 chad      TEXT 
13 2025-08-13 14:24:56 birger    TEXT 
14 2025-08-13 14:24:56 piece     TEXT 
15 2025-08-13 14:24:56 advice    TEXT 
16 2025-08-13 14:24:56 forget    TEXT 
17 2025-08-13 14:24:56 monitor   TEXT 
18 2025-08-13 14:24:56 message   TEXT 
19 2025-08-13 14:24:56 time      TEXT 
20 2025-08-13 14:24:56 saved     TEXT 

5 Word Frequency Analysis

Now I’ll analyze the most frequent words in my LinkedIn posts to understand my content patterns and key themes.

5.1 Word Frequency Calculation

I’ll calculate the frequency of each word across all my posts, which will help identify my most commonly used terms and topics.

# A tibble: 20 × 4
   word            n total_words frequency
   <chr>       <int>       <int>     <dbl>
 1 data           84        3460   0.0243 
 2 analytics      32        3460   0.00925
 3 business       27        3460   0.00780
 4 learning       26        3460   0.00751
 5 real           26        3460   0.00751
 6 i’m            24        3460   0.00694
 7 project        24        3460   0.00694
 8 i’ve           21        3460   0.00607
 9 time           18        3460   0.00520
10 engineering    17        3460   0.00491
11 building       16        3460   0.00462
12 sql            16        3460   0.00462
13 skills         15        3460   0.00434
14 spark          15        3460   0.00434
15 bootcamp       14        3460   0.00405
16 built          14        3460   0.00405
17 experience     14        3460   0.00405
18 it’s           14        3460   0.00405
19 journey        14        3460   0.00405
20 excited        13        3460   0.00376

5.2 Word Frequency Visualization

I’ll create a visualization to show my most frequent words and their relative frequencies.

My language clusters around three things. First, the core theme is clear with “data,” “analytics,” and “business” leading the list. Second, the frequent “I’m” and “I’ve” signals a first-person voice that tells a personal story rather than posting generic updates. Third, tool names like SQL and Spark show that I share hands-on work. Words such as “project,” “learning,” “building,” and “journey” reinforce a build-in-public approach where I document progress and lessons for others.

6 Bigram Analysis

Since I use many professional phrases like “data analyst”, “business analytics”, and “data engineering”, I’ll analyze bigrams to capture these important two-word combinations and track how they evolved over time.

6.1 Bigram Frequency Calculation

# A tibble: 20 × 4
   bigram                        n total_bigrams frequency
   <chr>                     <int>         <int>     <dbl>
 1 data engineering             14          1611   0.00869
 2 data science                  9          1611   0.00559
 3 real world                    9          1611   0.00559
 4 zach wilson                   9          1611   0.00559
 5 engineering bootcamp          7          1611   0.00435
 6 time series                   5          1611   0.00310
 7 blog post                     4          1611   0.00248
 8 business analytics            4          1611   0.00248
 9 chad birger                   4          1611   0.00248
10 data professionals            4          1611   0.00248
11 game changer                  4          1611   0.00248
12 series forecasting            4          1611   0.00248
13 technical skills              4          1611   0.00248
14 beacom school                 3          1611   0.00186
15 data analyst                  3          1611   0.00186
16 data analytics                3          1611   0.00186
17 data modeling                 3          1611   0.00186
18 data warehouse                3          1611   0.00186
19 dataengineering analytics     3          1611   0.00186
20 doordash delivery             3          1611   0.00186

6.2 Bigram Visualization

The bootcamp with Zach Wilson sparked my build-in-public habit and it shows up clearly. “Data engineering” is the dominant phrase, followed closely by “zach wilson,” “engineering bootcamp,” and “real world,” which frames my posts around practical work rather than theory. The next cluster points to my focus areas in data science. “Time series,” “series forecasting,” and “technical skills” highlight the topics I study and share. Mentions of “data professionals,” “business analytics,” and “data analyst” reflect how I position myself in the community, while “blog post” captures the way I document progress. References to “Chad Birger” and “Beacom School” anchor the story in my USD network and mentors.

7 Wordcloud Comparison: Unigrams vs Bigrams

I’ll create mirrored wordclouds to visually compare my most frequent individual words against my most frequent two-word phrases, providing an intuitive way to see the difference between single terms and professional phrases.

7.1 Unigram Wordcloud

7.2 Bigram Wordcloud

7.3 Side-by-Side Comparison

Single words cluster around data, analytics, learning, and projects, while bigrams spotlight data engineering, real world work, Zach Wilson, and time series. Together they confirm a shift toward original, applied content built in public.

8 Changes in Word Usage Over Time

I’ll analyze how my word usage has changed over time by creating monthly time bins and tracking word frequency changes. This will help identify which terms became more or less common in my LinkedIn posts throughout the year.

8.1 Creating Time-Based Word Counts

# A tibble: 363 × 5
   time_floor          word      count time_total word_total
   <dttm>              <chr>     <int>      <int>      <int>
 1 2024-11-01 00:00:00 analysis      1         26          6
 2 2024-11-01 00:00:00 data          2         26         84
 3 2024-11-01 00:00:00 i’m           2         26         24
 4 2024-11-01 00:00:00 i’ve          2         26         21
 5 2024-11-01 00:00:00 real          1         26         26
 6 2024-11-01 00:00:00 science       1         26          9
 7 2024-11-01 00:00:00 share         2         26          9
 8 2025-01-01 00:00:00 analyst       2        185          8
 9 2025-01-01 00:00:00 analytics     2        185         32
10 2025-01-01 00:00:00 apply         1        185          8
# ℹ 353 more rows

8.2 Creating Nested Data for Statistical Analysis

# A tibble: 89 × 2
   word      data            
   <chr>     <list>          
 1 analysis  <tibble [5 × 4]>
 2 data      <tibble [8 × 4]>
 3 i’m       <tibble [7 × 4]>
 4 i’ve      <tibble [5 × 4]>
 5 real      <tibble [7 × 4]>
 6 science   <tibble [4 × 4]>
 7 share     <tibble [4 × 4]>
 8 analyst   <tibble [3 × 4]>
 9 analytics <tibble [7 × 4]>
10 apply     <tibble [4 × 4]>
# ℹ 79 more rows

8.3 Fitting Logistic Regression Models

I’ll fit logistic regression models to check if each word becomes more or less common over time. A positive slope means usage is increasing, while a negative slope means usage is decreasing.

8.4 Extracting Slopes and Statistical Significance

8.5 Identifying Significant Word Changes

# A tibble: 0 × 6
# ℹ 6 variables: word <chr>, data <list>, term <chr>, estimate <dbl>,
#   std.error <dbl>, adjusted.p.value <dbl>

8.6 Visualizing Word Usage Changes Over Time

I put every post on a calendar, month by month, and watched the vocabulary move. For each month I counted how often a word showed up, then asked a simple question with a logistic model: is this word gaining ground or losing it as time passes. Since I tested many words, I corrected the p-values so only real signals make the cut. Finally I drew the paths so the shifts are easy to see.

The plot shows the arc. data stays the anchor, with clear surges in March and again in July. learning blooms in May when you leaned into teaching posts. analytics climbs through spring, cools in early summer, then turns up again in August. business drifts downward into the summer months. real holds a steady, lower baseline. Put together, it reads like a move from certificates and announcements to applied, teach-in-public content that your audience sticks with.

9 Engagement Analysis: Word Impact on Performance

I’ll analyze how specific words and phrases in my LinkedIn posts correlate with engagement metrics to understand what content drives the most reactions and reposts.

9.1 Question 1: Which words drive the most reactions?

To identify which individual words in my posts generate the highest average reactions, I’ll analyze word-level engagement metrics.

# A tibble: 10 × 4
   word         avg_reactions total_posts total_reactions
   <chr>                <dbl>       <int>           <dbl>
 1 carousel              99             3             297
 2 engineer              97.7           3             293
 3 nba                   91.7           3             275
 4 logic                 89.3           3             268
 5 based                 84.8           5             424
 6 patterns              70.9           7             496
 7 foundational          63.3           3             190
 8 career                62.3           3             187
 9 learnings             62             3             186
10 queries               59.8           4             239

Posts that mention carousel, engineer, and nba draw the highest average reactions. This fits my content mix. Most of my teaching posts are carousels, and my engineering-focused updates from the DE bootcamp resonate with my audience. The nba spike reflects a breakout post that tied analytics to sports. Terms like logic, patterns, and foundational also index well, which signals that practical, concept-first teaching lands.

9.2 Question 2: Which words drive the most shares?

To understand which words encourage my audience to share my content, I’ll analyze word-level sharing metrics.

# A tibble: 10 × 4
   word         avg_shares total_posts total_shares
   <chr>             <dbl>       <int>        <dbl>
 1 carousel           3.67           3           11
 2 engineer           3.33           3           10
 3 logic              3.33           3           10
 4 nba                3.33           3           10
 5 based              3              5           15
 6 learnings          2.67           3            8
 7 patterns           2.57           7           18
 8 career             2              3            6
 9 foundational       2              3            6
10 helping            2              3            6

Shares tell the same story.

Posts mentioning carousel, nba, and engineer get shared the most. Concept-first terms like logic, patterns, and foundational also travel well, and words like helping and career hint that practical, pay-it-forward content is what people forward to peers. (Averages are for words used in 3+ posts.)

9.3 Question 3: Which bigrams drive the most reactions?

To identify which two-word phrases generate the highest engagement, I’ll analyze bigram-level reaction metrics.

# A tibble: 10 × 4
   bigram                    avg_reactions total_posts total_reactions
   <chr>                             <dbl>       <int>           <dbl>
 1 data engineer                     128             2             256
 2 dataexpert.io zach                 89.5           2             179
 3 medium blog                        89.5           2             179
 4 data quality                       77             2             154
 5 blog post                          56.5           4             226
 6 engineering bootcamp               45.7           7             320
 7 zach wilson                        44.4           9             400
 8 community edition                  42.5           2              85
 9 dataengineering analytics          42             3             126
10 data engineering                   41.5          14             581

Reactions jump when the post ties to identity, teaching, and community. “Data engineer” sits at the top, which means career and role-focused updates land. “Medium blog” and “blog post” show that packaged lessons get strong pickup. Mentions of “dataexpert.io zach,” “engineering bootcamp,” and “community edition” confirm that community-led learning drives interest. “Data quality” is the standout technical theme with high average reactions.

9.4 Question 4: Which bigrams drive the most shares?

To understand which two-word phrases encourage sharing, I’ll analyze bigram-level sharing metrics.

# A tibble: 10 × 4
   bigram               avg_shares total_posts total_shares
   <chr>                     <dbl>       <int>        <dbl>
 1 data engineer              5              2           10
 2 dataexpert.io zach         3              2            6
 3 medium blog                3              2            6
 4 data quality               2.5            2            5
 5 blog post                  2.25           4            9
 6 grad school                2              2            4
 7 engineering bootcamp       1.57           7           11
 8 business analyst           1.5            2            3
 9 canyon ranch               1.5            2            3
10 personal brand             1.5            2            3

Shares tell the same story.

9.5 Question 5: How has mentioning “Zach Wilson” impacted my reach and engagement?

To analyze the specific impact of mentioning Zach Wilson on my posts’ performance, I’ll compare engagement metrics between posts with and without this mention.

# A tibble: 2 × 5
  mentions_zach        avg_impressions avg_reactions avg_shares total_posts
  <chr>                          <dbl>         <dbl>      <dbl>       <int>
1 Mentions Zach Wilson           4870.          44.4      1.33            9
2 No Mention                      817.          14.8      0.268          56

Mentioning Zach in my posts had a clear and measurable effect on reach and impressions. Posts that included his name consistently outperformed others, often by a significant margin, suggesting strong network effects from his audience and influence as a top 1% LinkedIn creator. This impact was particularly pronounced when combined with high-value content formats like teaching carousels, amplifying both visibility and engagement beyond my usual baseline.

10 Summary of Key Findings

This read of my LinkedIn year shows a clear shift from announcements to original, teach-in-public content. Images, text, and document carousels became the core. Vocabulary moved with that shift. “Data” stayed the anchor, “learning” spiked during teaching months, and “analytics” trended up again late summer. Engagement follows the same pattern. Carousels and engineering topics travel well, and posts tied to the DE bootcamp or Zach Wilson reach more people, likely due to network effects.

10.1 What worked

  • Teaching carousels with clear takeaways.

  • Role and craft language, for example data engineering, data quality, time series.

  • Community touchpoints, mentors, and program references